Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

OCR performance prediction using cross-OCR alignment

Identifieur interne : 000019 ( Main/Exploration ); précédent : 000018; suivant : 000020

OCR performance prediction using cross-OCR alignment

Auteurs : Ahmed Ben Salah [France] ; Jean-Philippe Moreux [France] ; Nicolas Ragot [France] ; Thierry Paquet [France]

Source :

RBID : Hal:hal-01191701

English descriptors

Abstract

Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition software (OCR). The modern technologies of OCR achieve good performances on modern documents produced with uniform layout and known fonts. However, for old documents, OCR results are of lower quality. The OCR quality assessment is a real challenge for the BnF. On the one hand, due to the sequential architecture of OCR treatments, the identification of OCR errors sources is intractable. On the other hand, besides the word confidence, no additional quality information is reported in OCR outputs. In this paper, we present a study on OCR performance estimation aiming to control the quality of word transcriptions achieved by OCR. This quality assessment process has to operate without any comparison with ground truthed data. In this respect, our methodology relies on cross alignment of the OCR results with those of a secondary OCR called reference OCR. This secondary OCR provides uncertain but useful information that will be used as uncertain groundtruth. OCR performance is estimated using support vector regression. This predictor uses some global features computed on the cross-alignment results. The experimentations reported show that our estimate describes more faithfully the quality of OCR outputs than average word confidence scores that are computed by OCR. The proposed methodology can be adapted easily to various corpora by tuning the system using a training dataset of documents that have similar properties to those to be treated.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">OCR performance prediction using cross-OCR alignment</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-205116" status="VALID">
<orgName>Bibliothèque nationale de France, Département de la Conservation</orgName>
<orgName type="acronym">BnF_DSC</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/professionnels/conservation.html</ref>
</desc>
<listRelation>
<relation active="#struct-300057" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300057" type="direct">
<org type="institution" xml:id="struct-300057" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<orgName type="acronym">BNF</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Moreux, Jean Philippe" sort="Moreux, Jean Philippe" uniqKey="Moreux J" first="Jean-Philippe" last="Moreux">Jean-Philippe Moreux</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-205116" status="VALID">
<orgName>Bibliothèque nationale de France, Département de la Conservation</orgName>
<orgName type="acronym">BnF_DSC</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/professionnels/conservation.html</ref>
</desc>
<listRelation>
<relation active="#struct-300057" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300057" type="direct">
<org type="institution" xml:id="struct-300057" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<orgName type="acronym">BNF</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01191701</idno>
<idno type="halId">hal-01191701</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01191701</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01191701</idno>
<date when="2015-08-23">2015-08-23</date>
<idno type="wicri:Area/Hal/Corpus">000089</idno>
<idno type="wicri:Area/Hal/Curation">000089</idno>
<idno type="wicri:Area/Hal/Checkpoint">000007</idno>
<idno type="wicri:Area/Main/Merge">000019</idno>
<idno type="wicri:Area/Main/Curation">000019</idno>
<idno type="wicri:Area/Main/Exploration">000019</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">OCR performance prediction using cross-OCR alignment</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-205116" status="VALID">
<orgName>Bibliothèque nationale de France, Département de la Conservation</orgName>
<orgName type="acronym">BnF_DSC</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/professionnels/conservation.html</ref>
</desc>
<listRelation>
<relation active="#struct-300057" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300057" type="direct">
<org type="institution" xml:id="struct-300057" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<orgName type="acronym">BNF</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Moreux, Jean Philippe" sort="Moreux, Jean Philippe" uniqKey="Moreux J" first="Jean-Philippe" last="Moreux">Jean-Philippe Moreux</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-205116" status="VALID">
<orgName>Bibliothèque nationale de France, Département de la Conservation</orgName>
<orgName type="acronym">BnF_DSC</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/professionnels/conservation.html</ref>
</desc>
<listRelation>
<relation active="#struct-300057" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300057" type="direct">
<org type="institution" xml:id="struct-300057" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<orgName type="acronym">BNF</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Historical documents digitization</term>
<term>Mass digitization projects</term>
<term>OCR quality assessment</term>
<term>Support Vector Regression (SVR)</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition software (OCR). The modern technologies of OCR achieve good performances on modern documents produced with uniform layout and known fonts. However, for old documents, OCR results are of lower quality. The OCR quality assessment is a real challenge for the BnF. On the one hand, due to the sequential architecture of OCR treatments, the identification of OCR errors sources is intractable. On the other hand, besides the word confidence, no additional quality information is reported in OCR outputs. In this paper, we present a study on OCR performance estimation aiming to control the quality of word transcriptions achieved by OCR. This quality assessment process has to operate without any comparison with ground truthed data. In this respect, our methodology relies on cross alignment of the OCR results with those of a secondary OCR called reference OCR. This secondary OCR provides uncertain but useful information that will be used as uncertain groundtruth. OCR performance is estimated using support vector regression. This predictor uses some global features computed on the cross-alignment results. The experimentations reported show that our estimate describes more faithfully the quality of OCR outputs than average word confidence scores that are computed by OCR. The proposed methodology can be adapted easily to various corpora by tuning the system using a training dataset of documents that have similar properties to those to be treated.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Centre-Val de Loire</li>
<li>Région Bourgogne</li>
<li>Région Centre</li>
</region>
<settlement>
<li>Rouen</li>
<li>Tours</li>
</settlement>
<orgName>
<li>Centre Val de Loire Université</li>
<li>Université François-Rabelais de Tours</li>
<li>Université de Rouen</li>
</orgName>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
</noRegion>
<name sortKey="Moreux, Jean Philippe" sort="Moreux, Jean Philippe" uniqKey="Moreux J" first="Jean-Philippe" last="Moreux">Jean-Philippe Moreux</name>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000019 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000019 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01191701
   |texte=   OCR performance prediction using cross-OCR alignment
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024